City College of San Francisco
MATH 108 - Foundations of Data Science
Associated Textbook Sections: 18.0 - 18.2
from datascience import *
import numpy as np
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
Interpretation by Physicians of Clinical Laboratory Results (1978)
We asked 20 house officers, 20 fourth-year medical students and 20 attending physicians, selected in 67 consecutive hallway encounters at four Harvard Medical School teaching hospitals, the following question:
If a test to detect a disease whose prevalence is 1/1000 has a false positive rate of 5%, what is the chance that a person found to have a positive result actually has the disease, assuming that you know nothing about the person's symptoms or signs?
Eleven of 60 participants, or 18%, gave the correct answer. These participants included four of 20 fourth-year students, three of 20 residents in internal medicine and four of 20 attending physicians. The most common answer, given by 27, was that [the chance that a person found to have a positive result actually has the disease] was 95%.
All patients fall into one of 4 categories:
Probability of an event given some information (it is conditioned on the information) Example:
Answer: the conditional, will see why in a moment.
from IPython.display import IFrame
IFrame('https://docs.google.com/presentation/d/e/2PACX-1vRiLsFDsuuT\
_fGEkjNJJ5Yv6MdEkWshYniIDyrzR4F4vN7UkAUgwT-MrhUTy8_gxwyhLv3rTleNScXw\
/embed?start=false&loop=false&delayms=3000', 960, 569)
from IPython.display import IFrame
IFrame('https://docs.google.com/presentation/d/e/2PACX-1vTYqt2\
-0qckaBNAHfug29S4o0IV-tCrPkOp3a01wWsx65iyAmpFX3gI9ROkaZ21Syf77\
xyiIIDrGAgS/embed?start=false&loop=false&delayms=3000', 960, 569)
from IPython.display import IFrame
IFrame('https://docs.google.com/presentation/d/e/2PACX-1vSTI_AHfonqA-\
ww_uTioJOpF_sy8PHvEkaZ1B0ahy-KdKXygejBtQeQpIACZ0xNLnEYCfTbfkSC3Klw/\
embed?start=false&loop=false&delayms=3000', 960, 569)
Assume a patient is picked at random.
Create a function that calculates $P(A \mid B) = \frac{P(A) \cdot P(B\mid A)}{P(B)}$
def bayes_rule(pr_a, pr_b_given_a, pr_b_given_not_a):
"""
Bayes' Rule
P(A | B) = P(A)P(B|A) / P(B)
To Compute P(B)
P(B) = P(B, A) + P(B, Not A)
= P(A)P(B|A) + P(Not A)P(B | Not A)
"""
pr_b = pr_a * pr_b_given_a + (1 - pr_a) * pr_b_given_not_a
return (pr_a * pr_b_given_a) / pr_b
Use bayes_rule to calculate the probability for the original medical question.
pr_disease = 1/1000
pr_pos_given_disease = 0.99
pr_pos_given_no_disease = 0.05
bayes_rule(pr_disease, pr_pos_given_disease, pr_pos_given_no_disease)
0.019434628975265017
How does the conditional probability change when the prior is larger?
# updating with a subjective prior of 1%
pr_disease_update = 10/1000
pr_pos_given_disease = 0.99
pr_pos_given_no_disease = 0.05
bayes_rule(pr_disease_update, pr_pos_given_disease, pr_pos_given_no_disease)
0.16666666666666669
# updating with a subjective prior of 10%
pr_disease_update2 = 100/1000
pr_pos_given_disease = 0.99
pr_pos_given_no_disease = 0.05
bayes_rule(pr_disease_update2, pr_pos_given_disease, pr_pos_given_no_disease)
0.6875
Notice how quickly the Posterior probability climbs as the Prior probability increases.
pr_disease = np.arange(1,999)/1000
pr_pos_given_disease = 0.99
pr_pos_given_no_disease = 0.05
post = bayes_rule(pr_disease, pr_pos_given_disease, pr_pos_given_no_disease)
Table().with_columns(
"Prior Pr(Disease)", pr_disease,
"Posterior Pr(Disease | Pos. Test)", post).iplot("Prior Pr(Disease)")